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Abstract:- Despite all the hypes, there are number of efforts has been taken to research in the field of Natural 
language processing but it has number of problems, such as ambiguity, limited coverage and lack of relative 
importance or we can say less accuracy in terms of processing. To reduce these problems and increase the 
accuracy we use "EQUIRS: Explicitly query understanding information retrieval system based on HMM". In 
this frame work, we use Hidden Markov Model (HMM) to improve the Accuracy and results, resolve the 
problem of ambiguity efficiently. Previously, various model used to improve the accuracy of text query, in 
which one of the most selective method is Fuzzy clustering method, but it is fail to reduce limited coverage 
problem. To reducing such problem and improving accuracy EQUIRS based on HMM and compare it with the 
result of fuzzy clustering techniques. 

In the proposed frame work first 900 file is used to train which is divided into five file class categories 
called five query view cluster (organization, topic, exchange, place, people). Now, HMM is simply finding the 
nearest probability distance with the fired text query using QPU (Query Process Unit) and HMM will return 
suggestion based on emission probability (suggestion depth 5) which similar to query view. Thus proposed 
approach is different and has satisfied qualitative proficiency with using taxonomy of clustering (Precision, 
Recall, F-Measure, Training Time and Searching Time) from fuzzy based learning which has less accuracy 
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I. INTRODUCTION 

Natural language processing is becoming one of the most active areas in Human-computer Interaction. 
The goal of NLP is to enable communication between people and computers without resorting to memorization 
of complex commands and procedures. In other words, NLP is techniques which can make the computer 
understand the languages naturally used by humans. While natural language may be the Easiest symbol system 
for people to learn and use, it has proved to be the hardest for a computer to master. Despite the challenges, 
natural language processing is widely regarded as a promising and critically important endeavor in the field of 
computer research. The general goal for most computational linguists is to instill the computer with the ability to 
understand and generate natural language so that eventually people can address their computers through text as 
though they were addressing another person. The applications that will be possible when NLP capabilities are 
fully realized are impressive computers would be able to process natural language, translating languages 
accurately and in real time, or extracting and summarizing information from a variety of data sources, 
depending on the users' requests. 

A hidden Markov model (HMM) [12] is a statisticalMarkov model in which the system being modeled 
is assumed to be a Markov process with unobserved (hidden) states. An HMM can be considered as the simplest 
dynamic Bayesian network. 
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Figurel.l:- Hidden Markov Model 
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Probabilistic parameters of a HMM (example). 
x — states 

y — possible observations 

a — state transition probabilities 

b — output probabilities 

In a regular Markov model, the state is directly visible to the observer, and therefore the state transition 
probabilities are the only parameters. In a hidden Markov model, the state is not directly visible, but output, 
dependent on the state, is visible. Each state has a probability distribution over the possible output tokens. 
Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states. 
Note that the adjective 'hidden' refers to the state sequence through which the model passes, not to the 
parameters of the model; even if the model parameters are known exactly, the model is still 'hidden' 
In this paper, we have evaluated training time, searching time and accuracy of the proposed algorithm. To 
measure these performance parameters we have used transaction data set that contains five file classes which is 
taken from Reuters-21578 text categorization test collection Distribution 1.0 README fde [13]. 
As experimental result, the proposed algorithm retrieves information from large dataset with more training time 
and more searching time and also with great accuracy. The main purpose of the proposed algorithm is to 
improve precision, recall and accuracy. 

II. BACKGROUND 

Hidden Markov Models (HMM) [1] can also be used for classifying patterns from an unknown dataset. 
For example, in speech related literature HMM has been used for classifying speakers [2-3] or speech patterns 
[4, 5]. Typically, for pattern classification, a number of HMM are used in combination with supervised 
techniques. In this paper, we propose an EQUIRS based on HMM algorithm. In our model, a single HMM is 
used to identify the number of sequence and stats in a given dataset. The data items are then labeled and 
partitioned into the appropriate five file indexes. Initially, the HMM is used to calculate emission probability for 
each of the data items. Here, the emission probability on one hand represent how well the data fits the trained 
HMM and on the other provide a similarity measure between data items. While Hidden Markov Models have 
not been employed in web query classification, they have been extensively studied and applied in document 
classification [9], text categorization of multi-page documents [11], recognizing facial expressions from video 
sequences [8], and the infamous HMM part of speech tagger [7] and speech recognition [10]. While Cohen et al 
used the temporal facial expressions as the HMM states, speech recognition involves the phone symbols as the 
observation sequence [10]. Hidden Markov Models (HMM) were first introduced in the 1970s as a tool for 
speech recognition [6]. Recently, the popularity of HMM has increased in the pattern recognition domain 
primarily because of its strong mathematical basis and the ability to adapt to unknown data. This section 
describes HMM in more detail together with a description of the algorithms used to induce HMM. Further 
details can be found in [1]. 

The Hidden Markov Model (HMM) is a variant of a finite state machine having a set of hidden states, Q, an 
output alphabet (observations), O, transition probabilities, A, output (emission) probabilities, B, and initial state 
probabilities, II. The current state is not observable. Instead, each state produces an output with a certain 
probability (B). Usually the states, Q, and outputs, O, are understood, so an HMM is said to be a triple, 

(A,B,n). 

Mathematical Definition: 

Hidden states Q = { q, }, i = 1, . . . , N . 

Transition probabilities A = { a i} ■. = P(qj at t + 1 lg,atf)}, where P(a I b) is the conditional probability 
of a given b, t = 1, . . . , Tis time, and q- t in Q. Informally, A is the probability that the next state is q s given that 
the current state is q t . 

Observations (symbols) O = { o k }, k= 1, . . . , M . 

Emission probabilities B = { b ik = b i (o k ) = P(o k \qi) }, where o k in O. Informally, B is the probability that the 
output is o k given that the current state is q t . 
Initial state probabilities 77 = {/?, = P{q t at t = 1)}. 

III. PROPOSED WORK 

In this paper, we propose a new model for natural language processing for text query information 
retrieval system is called EQUIRS: Explicitly Query Understanding Information Retrieval System based on 
HMM. These methods have significant theoretical advantages and it has shown impressive performance in many 
tasks such as text categorization test collection database, goal of text query understanding and automatic retrieve 
information probabilistic base is to compare the input text query vector with all the classes and then declare a 
decision that identifies to whom the input text query vector belongs to or if it doesn't belong to the database at 
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all. In this research work, Text query understanding is studied as an ambiguity and lack of knowledge problem. 
To tackle this problems problem our proposed model are considered in research work. 
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Figurel.2: - EQUIRS based on HMM architecture 

Proposed Algorithm 

Input- Training data set D (N is the total number of fde; n is total number of training fde), Input_query( 1 to x), 

min_sug (suggession_depth) 

Output - Performance and Comparison. 



Training 

Step 1:- Read all file data sets D (1 to N). 

Step 2:- All data fde N are convert it into class vector matrix and save in matrix vectors. 

Step 3:- Apply it into hidden markov model in step 2. 

Step 4:- After step 3 we calculate the emission probability Matrix (EMIS). 

Step 5:- Store EMIS and Vectors. 



Testing 

Step 1:- Read input query (length 1 to x). 
Step 2:- Input query are convert it into vector. 
Step 3:- Load Transmission vector. 
Step 4:- Add vector with training vector. 

Step 5:- Hidden markov model are calculate Emission probabilities matrix. 

Step 6:- Measure the most similar entries in step 5. 

Step 7:- Calculate the similar entries vector in EMIS matrix. 

Step 8:- Convert vector into string and display 

Step 9:- End. 



IV. RESULT 

Performance Parameters 

We measure the performance of our algorithm in the form of following parameters: 
Training Time 

Training time can be defined as the total time requires training the algorithm. There we generally 
compare the training time with fuzzy cluster model and EQUIRS based on HMM. In the previous fuzzy based 
model k-means algorithm is used divide knowledge into cluster due to this is required less time to training i.e. 
log 2 n. Where as in our proposed approach will have to take more time to training then fuzzy because is use 
HMM. In which may sequence of state is generated. Which will take approx. 0(log 2 n) time to train. 

Searching Time 

The searching time can be defined as total amount of time required to fining or retrieving a result. 
Generally it is important for any algorithm for its efficiency and always tries to keep minimum. However, in 
over algorithm is take more searching time then fuzzy based model roughly our model take O (log2n) time 
approximately which is equivalent to complexion of binary search.For classification tasks, [15] the terms true 
positives, true negatives, false positives, and false negatives compare the results of the classifier under test 
with trusted external judgments. The terms positive and negative refer to the classifier's prediction (sometimes 
known as the expectation), and the terms true and false refer to whether that prediction corresponds to the 
external judgment (sometimes known as the observation) 
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Precision 

In our proposed approach EQUIRS based on HMM calculate the precision of information retrieval, 
precision is the fraction of retrieved documents that are relevant to the search: 

Precision — 

tp + fp 

Precision takes all rettieved documents into account, but it can also be evaluated at a given cut-off 
rank, considering only the topmost results returned by the system. This measure is called precision at n. 

Recall 

Recall in information retrieval is the fraction of the documents that are relevant to the query that are 
successfully retrieved. 

Recall = — 

tp + frt 

For example for text search on a set of documents recall is the number of correct results divided by the number 
of results that should have been returned 



F_measure 

A measure that combines precision and recall is the harmonic mean of precision and recall, the 
traditional F-measure or balanced F-score: 

^ precision ■ recall 
precision + recall 

This is also known as the F_measure, because recall and precision are evenly weighted. 
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Graphl.l: Graph shows the training time difference between previous Fuzzy clustering approaches 
and our proposed approach EQUIRS based on HMM. 
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Graphl.2: Graph shows the searching time used difference between previous fuzzy clustering 
approach and our proposed approach EQUIRS based on HMM. 
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Graphl.3: Graph shows the precision difference between previous fuzzy clustering approach and our proposed 

approach EQUIRS based on HMM. 
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Graph 1.4: Graph shows the recall difference between previous fuzzy clustering approach and our proposed 

approach EQUIRS based on HMM. 
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Graphl.5: Graph shows the F_Measure difference between previous fuzzy clustering approach and our 

proposed approach EQUIRS based on HMM. 

The above graph shows the result comparison which is generated by our proposed approach (EQUIRS: 
Explicitly Query Understanding Information Retrieval System based on HMM) and the previous method which 
is based on FCM [14]. In every graph it is clear that the time of training time, searching time, precision, recall 
and F_measure. Training time and searching time more than previous approach. The blue line is our EQUIRS 
approach based on HMM that takes more time to compute the result and red line indicate the fuzzy cluster 
model approach which takes less time in result generation for training time and searching time. Due the 
algorithm our proposed approach EQUIRS: Explicitly Query Understanding Information Retrieval System 
based on HMMis also taking less memory because HMM calculate the emission probability on current state not 
previous state. The graph 1.3, 1.4 and graph 1.5 gives the clear indication of the (94%) efficient accuracy usage 
of previous fuzzy clustering model approach and the proposed EQUIRS: Explicitly Query Understanding 
Information Retrieval System based on HMM approach. 
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Table 1:- Table show result of both FCM and EQUIRS based on HMM proposed approach when taking 

different query and generate result. 



The table 1 show the result comparison of fuzzy cluster match (FCM) and hidden markov model 
(HMM) when generating the result in term of precision (P), recall (R), F_measure (F), training time (TT) and 
searching time (ST). In this we take different text query and generate the result with both the approaches and 
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stored the precision, recall, F_measure, training time and searching time. As show in table for different number 
of text query for parameter training time and searching time of our proposed approach EQUIRS is greater than 
previous approach FCM. Therefore our proposed approach which is based on HMM is more efficient accuracy 
than the FCM approach. 

V. FUTURE WORK 

In this work we proposed an information retrieval system. We use emission probabilities based on 
likelihood sequence of state based Hidden Markov Model (HMM). Experiments on textual queries in multiple 
domain show that the proposed approach can improve the performance of which using taxonomy of clustering 
(Precision, Recall, F_measure, Training time and Searching time) significantly. 

Know potentially the same approached can be applied to spoken queries given reliable speech 
recognition. For future work we will applied the lexicon modeling approached to larger datasets , we will also 
explore the use of other external resources such as Wikipedia for automatic learning. 
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